1 Executive Summary

This report explores the pricing dynamics of Airbnb listings in Sydney by utilising machine learning classification models to predict Sydney property price categories Budget (<$100), MidMarket ($100-$200) and Premium (>$200) using property characteristics, location data and host information. Our analysis addresses key Australian housing market challenges while providing actionable insights for tourism and rental property sectors. We begin analysis with detailed data cleaning such as formatting, handling missing values and addressing outliers. EDA performed on the dataset highlights geographic clustering of premium properties around Sydney Harbour and CBD, while budget options are more spread towards outer suburbs. Overall, this study demonstrates how data-driven classification can uncover meaningful patterns in Airbnb pricing, supporting more informed decision-making across the platform’s ecosystem.


2 Problem Definition

2.1 Research Question

Can we predict whether a Sydney Airbnb property will be classified as Premium (>$200/night), MidMarket ($100-200/night), or Budget (<$100/night) based on property characteristics, location and host factors?

2.2 Classification Problem Framework

This project focuses on a multi-class classification problem with three distinct target categories:

  • Premium: Properties >$200/night (luxury market segment)
  • MidMarket: Properties $100-200/night (mainstream market)
  • Budget: Properties <$100/night (budget-conscious travelers)

The classification approach enables predictive insights for property investors, market segmentation analysis for tourism planning and guiding in pricing strategy guidance for potential hosts.

2.3 Business Rationale for Price Categorization

While property prices exist on a continuous scale, converting them into discrete market segments provides substantial practical and strategic value for multiple stakeholders:

1. Consumer Decision-Making and Search Behavior

Travelers typically approach accommodation search with a budget category in mind rather than exact price points. The three-tier classification reflects natural consumer behavior patterns where users mentally categorize options as “budget-friendly,” “mid-range,” or “luxury” before drilling down into specific listings. This categorization mirrors common filtering mechanisms on booking platforms.

2. Investment and Portfolio Strategy

Property investors require clear market positioning to guide acquisition and renovation decisions. A binary determination of whether a property will command Budget, Mid-Market, or Premium rates directly informs: - Renovation budget allocation and expected ROI - Target demographic and marketing positioning - Competitive positioning within specific neighborhoods - Risk assessment for new property investments

3. Regulatory and Policy Applications

Australian housing policy and short-term rental regulations often distinguish between different accommodation tiers. Premium properties may face different regulatory scrutiny regarding their impact on long-term housing availability compared to budget options. Classification models can inform evidence-based policy decisions about short-term rental impacts on housing affordability.

4. Market Segmentation and Pricing Strategy

Hosts benefit from understanding which category their property naturally falls into based on structural features, location, and amenities. Rather than marginally adjusting a continuous price, hosts can make strategic decisions about whether feature upgrades would move their property into a higher tier, fundamentally changing their market position and revenue potential.

5. Tourism Planning and Economic Analysis

Sydney’s tourism industry and economic planners require segmented accommodation data to understand market composition. Classification reveals whether the city has adequate budget options for students and backpackers, sufficient mid-market options for families, and appropriate luxury inventory for high-spending tourists. This information guides tourism infrastructure planning and economic development strategies.

6. Statistical and Modeling Considerations

From an analytical perspective, discrete categories reduce the impact of measurement noise in self-reported nightly rates, handle non-linear relationships between features and price tiers more effectively than linear regression assumptions, and provide clearer, more actionable insights than continuous predictions with confidence intervals.

This classification framework transforms a continuous prediction problem into an actionable decision support tool, providing clear categorical predictions that align with how stakeholders actually use pricing information in real-world decisions.

3 Data Description

The Sydney Airbnb Listings dataset contains detailed information on over 18,000 listings across the city, with approximately 79 variables describing property characteristics, host details, geographic location, availability and customer engagement. Key attributes include listing identifiers, host information, neighbourhoods, room type, number of reviews, minimum nights, availability and pricing. For the purpose of this study, the focus is on the price variable, which has been cleaned to remove formatting and extreme outliers, and subsequently transformed into a categorical target variable representing three market segments (Inside Airbnb, 2025; Cox, 2024).

Show/Hide Code & Results
# Load required libraries
library(tidyverse)      # Data manipulation and visualization
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(VIM)           # Missing data visualization
## Loading required package: colorspace
## Loading required package: grid
## VIM is ready to use.
## 
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## 
## The following object is masked from 'package:datasets':
## 
##     sleep
library(corrplot)      # Correlation plots
## corrplot 0.95 loaded
library(ggplot2)       # Advanced plotting
library(dplyr)         # Data manipulation
library(readr)         # Reading CSV files
library(stringr)       # String manipulation
library(plotly)        # Interactive plots
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(gridExtra)     # Multiple plots
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(scales)        # Scale formatting
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
library(knitr)         # Table formatting
library(DT)            # Interactive tables
library(MLmetrics)     # Machine learning metrics
## 
## Attaching package: 'MLmetrics'
## 
## The following object is masked from 'package:base':
## 
##     Recall
library(pROC)          # ROC curve analysis
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following object is masked from 'package:colorspace':
## 
##     coords
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

3.1 Data Source

Primary Dataset: Inside Airbnb Sydney Listings

Reference: Inside Airbnb. (2025). Sydney, New South Wales, Australia Dataset. Retrieved from http://insideairbnb.com/get-the-data/. Data sourced from publicly available information from Airbnb.com. Murray Cox, Inside Airbnb Project.

Data Collection Method: Web scraping of publicly available Airbnb listing information

Data Currency: Most recent quarterly snapshot available (2025)

Show/Hide Code & Results
# Load and examine the dataset
listings_raw <- read_csv("listings.csv")
## Rows: 18187 Columns: 79
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (25): listing_url, source, name, description, neighborhood_overview, pi...
## dbl  (42): id, scrape_id, host_id, host_listings_count, host_total_listings_...
## lgl   (7): host_is_superhost, host_has_profile_pic, host_identity_verified, ...
## date  (5): last_scraped, host_since, calendar_last_scraped, first_review, la...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Replace all "N/A" values with blank in character columns
char_cols <- names(listings_raw)[sapply(listings_raw, is.character)]
for(col in char_cols) {
  listings_raw[[col]][listings_raw[[col]] == "N/A"] <- ""
}

# Define variables
dims <- paste(dim(listings_raw), collapse = " x ")
nvars <- ncol(listings_raw)


# Print all in one cat
cat(" Full dataset dimensions:", dims, "\n", "Total variables available:", nvars, "\n")
##  Full dataset dimensions: 18187 x 79 
##  Total variables available: 79

3.2 Feature Selection Strategy

Given the comprehensive nature of the Inside Airbnb dataset (18187 listings x 79 features), we employ a strategic feature selection approach focusing on variables most relevant to pricing classification.

Show/Hide Code & Results
# FEATURE SELECTION: Selecting the most relevant variables for classification
selected_features <- c(
  "id", "price", "property_type", "room_type", "accommodates",
  "bedrooms", "bathrooms", "amenities", "neighbourhood_cleansed",
  "latitude", "longitude", "host_is_superhost", "host_response_rate",
  "host_listings_count", "host_identity_verified", "review_scores_rating",
  "number_of_reviews", "reviews_per_month", "availability_365",
  "minimum_nights"
)

3.3 Dataset Overview

Show/Hide Code & Results
# Basic dataset information
listings_raw <- listings_raw %>%
  mutate(across(where(~ all(. %in% c("t", "f"))), ~.=="t"))
# Selected only the chosen features
listings <- listings_raw %>%
  select(all_of(selected_features))


# Variable types
numeric_vars <- listings %>% select_if(is.numeric) %>% names()
character_vars <- listings %>% select_if(is.character) %>% names()
boolean_vars <- listings %>% select_if(is.logical) %>% names()



cat(" DATASET SUMMARY:\n", "Number of observations:", nrow(listings), ", Number of variables:", ncol(listings), "\n", "\n VARIABLE TYPES:\n", "Numeric variables (", length(numeric_vars), "):", paste(numeric_vars, collapse = ", "), "\n","Character variables (", length(character_vars), "):", paste(character_vars, collapse = ", "), "\n", "Boolean variables (", length(boolean_vars), "):", paste(boolean_vars, collapse = ", "), "\n")
##  DATASET SUMMARY:
##  Number of observations: 18187 , Number of variables: 20 
##  
##  VARIABLE TYPES:
##  Numeric variables ( 12 ): id, accommodates, bedrooms, bathrooms, latitude, longitude, host_listings_count, review_scores_rating, number_of_reviews, reviews_per_month, availability_365, minimum_nights 
##  Character variables ( 6 ): price, property_type, room_type, amenities, neighbourhood_cleansed, host_response_rate 
##  Boolean variables ( 2 ): host_is_superhost, host_identity_verified

3.4 Target Variable Creation and Categorical Preprocessing

To simplify classification process, the continuous target variable Price was transformed into a categorical outcome representing distinct market segments. Raw price values, originally stored as character strings with currency symbols and commas were first cleaned and converted into numeric format. Extreme outliers such as nightly rates in higher ranges were excluded to reduce noise and improve model stability.

Additionally, to prevent issues with rare categorical levels appearing only in test data, we preprocess high-cardinality categorical variables by grouping rare categories into an “Other” category.

Show/Hide Code & Results
# Creating target variable based on price thresholds

# Cleaning price data
listings$price_numeric <- as.numeric(gsub("[$,]", "", listings$price))

# Creating price categories
listings$price_category <- cut(
  listings$price_numeric,
  breaks = c(0, 100, 200, Inf),
  labels = c("Budget", "MidMarket", "Premium"),
  include.lowest = TRUE
)

# Summaries
price_summary <- summary(listings$price_numeric)
target_dist <- table(listings$price_category)
target_props <- prop.table(target_dist) * 100


cat(
  "CREATING TARGET VARIABLE:\n\n",
  "PRICE SUMMARY (in $ per night):\n",
  sprintf("Min       : %.2f\n", price_summary["Min."]),
  sprintf("1st Qu.   : %.2f\n", price_summary["1st Qu."]),
  sprintf("Median    : %.2f\n", price_summary["Median"]),
  sprintf("Mean      : %.2f\n", price_summary["Mean"]),
  sprintf("3rd Qu.   : %.2f\n", price_summary["3rd Qu."]),
  sprintf("Max       : %.2f\n\n", price_summary["Max."]),
  "TARGET VARIABLE DEFINITION:\n",
  "- Budget      : $0-100/night (Budget-conscious travelers)\n",
  "- MidMarket   : $100-200/night (Mainstream market)\n",
  "- Premium     : >$200/night (Luxury segment)\n\n",
  "TARGET VARIABLE DISTRIBUTION:\n",
  sprintf("Budget      : %d (%.2f%%)\n", target_dist["Budget"], target_props["Budget"]),
  sprintf("MidMarket   : %d (%.2f%%)\n", target_dist["MidMarket"], target_props["MidMarket"]),
  sprintf("Premium     : %d (%.2f%%)\n", target_dist["Premium"], target_props["Premium"]),
  sprintf("NaN values  : %d \n", (18187-(target_dist["Budget"]+target_dist["MidMarket"]+target_dist["Premium"]))),
  sep = ""
)
## CREATING TARGET VARIABLE:
## 
## PRICE SUMMARY (in $ per night):
## Min       : 17.00
## 1st Qu.   : 139.00
## Median    : 206.00
## Mean      : 339.47
## 3rd Qu.   : 329.00
## Max       : 20000.00
## 
## TARGET VARIABLE DEFINITION:
## - Budget      : $0-100/night (Budget-conscious travelers)
## - MidMarket   : $100-200/night (Mainstream market)
## - Premium     : >$200/night (Luxury segment)
## 
## TARGET VARIABLE DISTRIBUTION:
## Budget      : 2181 (13.86%)
## MidMarket   : 5433 (34.53%)
## Premium     : 8120 (51.61%)
## NaN values  : 2453
# Bar Plot for Price vs Number of Properties
ggplot(listings, aes(x = price_category, fill = price_category)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = ..count..), vjust = -0.5) +
  labs(title = "Distribution of Sydney Airbnb Price Categories",
       subtitle = "Classification Target Variable",
       x = "Price Category", y = "Number of Properties") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4 Data Cleaning and Preparation

The raw dataset required extensive cleaning and preprocessing to ensure reliability for data analysis and modeling using classification algorithms. Non-numeric entries in the price field were removed before converting the variable values to numeric ones. Character columns containing N/A values were standardized by imputation using median strategy or by replacing with minimum values (0, 1, FALSE). The cleaned dataset provided a complete and consistent foundation with a refined set of features suitable for exploratory analysis and predictive modeling (Michelucci, 2025).

4.1 Data Preparation

Show/Hide Code & Results
output_str <- ""

# 1. Missing values analysis
missing_summary <- listings %>%
  summarise_all(~sum(is.na(.))) %>%
  gather(variable, missing_count) %>%
  mutate(missing_percent = round(missing_count / nrow(listings) * 100, 2)) %>%
  filter(missing_count > 0) %>%
  arrange(desc(missing_percent))

if(nrow(missing_summary) > 0) {
  output_str <- paste0(output_str, "1. Missing Values Detected:\n")
  output_str <- paste0(output_str, "Number of missing columns: ", nrow(missing_summary), "\n")
  
  # Create formatted strings for each variable
  missing_strings <- paste0(missing_summary$variable, ": ", 
                           missing_summary$missing_count, " (", 
                           missing_summary$missing_percent, "%)")
  
  # Join all with | separator
  output_str <- paste0(output_str, paste(missing_strings, collapse = " | "), "\n")
} else {
  output_str <- paste0(output_str, "1. Missing Values: No missing values detected in selected features\n")
}

# 2. Price outliers
price_outliers <- listings %>%
  filter(price_numeric > quantile(price_numeric, 0.99, na.rm = TRUE) | 
         price_numeric < quantile(price_numeric, 0.01, na.rm = TRUE)) %>%
  nrow()
output_str <- paste0(output_str, "\n2. Price Outliers:\n")
output_str <- paste0(output_str, "Potential price outliers (beyond 1st/99th percentile): ", price_outliers, "\n")

# 3. Categorical variable complexity
output_str <- paste0(output_str, "\n3. High-Dimensional Categorical Data:\n")
output_str <- paste0(output_str, "Number of unique neighbourhoods: ", length(unique(listings$neighbourhood_cleansed)), "\n")
output_str <- paste0(output_str, "Number of unique property types: ", length(unique(listings$property_type)), "\n")

# 4. Class imbalance
min_class_prop <- min(prop.table(table(listings$price_category)))
max_class_prop <- max(prop.table(table(listings$price_category)))
imbalance_ratio <- max_class_prop / min_class_prop
output_str <- paste0(output_str, "\n4. Class Imbalance Analysis:\n")
output_str <- paste0(output_str, "Class imbalance ratio: ", round(imbalance_ratio, 2), ":1\n")

# 5. Additional challenges that can be considered
output_str <- paste0(output_str, "\n5. Additional Challenges that can be consideredD:\n")
output_str <- paste0(output_str, "- Geographic clustering effects in Sydney neighborhoods\n")
output_str <- paste0(output_str, "- Seasonal pricing variations not captured in snapshot data\n")
output_str <- paste0(output_str, "- Text processing requirements for amenities field\n")
output_str <- paste0(output_str, "- Potential correlation between location and property characteristics\n")

cat(output_str)
## 1. Missing Values Detected:
## Number of missing columns: 11
## review_scores_rating: 3179 (17.48%) | reviews_per_month: 3179 (17.48%) | bathrooms: 2458 (13.52%) | price: 2453 (13.49%) | price_numeric: 2453 (13.49%) | price_category: 2453 (13.49%) | host_is_superhost: 556 (3.06%) | bedrooms: 436 (2.4%) | host_response_rate: 5 (0.03%) | host_listings_count: 5 (0.03%) | host_identity_verified: 5 (0.03%)
## 
## 2. Price Outliers:
## Potential price outliers (beyond 1st/99th percentile): 301
## 
## 3. High-Dimensional Categorical Data:
## Number of unique neighbourhoods: 38
## Number of unique property types: 69
## 
## 4. Class Imbalance Analysis:
## Class imbalance ratio: 3.72:1
## 
## 5. Additional Challenges that can be consideredD:
## - Geographic clustering effects in Sydney neighborhoods
## - Seasonal pricing variations not captured in snapshot data
## - Text processing requirements for amenities field
## - Potential correlation between location and property characteristics

4.2 Data Cleaning

Show/Hide Code & Results
output_text <- ""

# Initial data dimensions
initial_dim <- dim(listings)
output_text <- paste0(output_text, "Initial data dimensions: ", initial_dim[1], " rows x ", initial_dim[2], " columns\n\n")

# 1. Handling price data
listings$price_numeric <- as.numeric(gsub("[$,]", "", listings$price))
outlier_threshold <- 1000
initial_count <- nrow(listings)
listings <- listings %>% filter(price_numeric > 0 & price_numeric <= outlier_threshold)
removed_outliers <- initial_count - nrow(listings)
output_text <- paste0(output_text, "Removed ", removed_outliers, " extreme price outliers (>$", outlier_threshold, ")\n")
output_text <- paste0(output_text, "Remaining observations: ", nrow(listings), "\n\n")

# 2. Clean host_is_superhost
listings$host_is_superhost <- ifelse(listings$host_is_superhost == "t", TRUE, FALSE)

# 3. Handle host_response_rate
if("host_response_rate" %in% names(listings)) {
  listings$host_response_rate <- as.numeric(gsub("%", "", listings$host_response_rate)) / 100
}

# 4. Process amenities
if("amenities" %in% names(listings)) {
  listings$amenities_count <- ifelse(
    is.na(listings$amenities) | listings$amenities == "" | listings$amenities == "[]",
    0,
    str_count(listings$amenities, '",') + 1
  )
} else {
  listings$amenities_count <- 0
}

# 5. Handle missing values
missing_summary <- listings %>%
  summarise_all(~sum(is.na(.))) %>%
  gather(variable, missing_count) %>%
  mutate(missing_percent = round(missing_count / nrow(listings) * 100, 2)) %>%
  filter(missing_count > 0) %>%
  arrange(desc(missing_percent))

if(nrow(missing_summary) > 0) {
  for(i in 1:nrow(missing_summary)) {
    output_text <- paste0(output_text, sprintf("- %s: %d missing (%.2f%%)\n",
                                               missing_summary$variable[i],
                                               missing_summary$missing_count[i],
                                               missing_summary$missing_percent[i]))
  }
  output_text <- paste0(output_text, "\n")
  
  # Imputation
  if("reviews_per_month" %in% missing_summary$variable) {
    listings$reviews_per_month[is.na(listings$reviews_per_month)] <- 0
  }
  if("host_is_superhost" %in% missing_summary$variable) {
    listings$host_is_superhost[is.na(listings$host_is_superhost)] <- FALSE
  }
  if("bathrooms" %in% missing_summary$variable) {
    median_bathrooms <- median(listings$bathrooms, na.rm = TRUE)
    listings$bathrooms[is.na(listings$bathrooms)] <- median_bathrooms
  }
  if("host_listings_count" %in% missing_summary$variable) {
    listings$host_listings_count[is.na(listings$host_listings_count)] <- 1
  }
  if("host_identity_verified" %in% missing_summary$variable) {
    listings$host_identity_verified[is.na(listings$host_identity_verified)] <- FALSE
  }
  if("bedrooms" %in% missing_summary$variable) {
    listings$bedrooms[is.na(listings$bedrooms)] <- ceiling(listings$accommodates[is.na(listings$bedrooms)] / 2)
  }
  if("review_scores_rating" %in% missing_summary$variable) {
    median_rating <- median(listings$review_scores_rating, na.rm = TRUE)
    listings$review_scores_rating[is.na(listings$review_scores_rating)] <- median_rating
  }
  if("host_response_rate" %in% missing_summary$variable) {
    median_response_rate <- median(listings$host_response_rate, na.rm = TRUE)
    listings$host_response_rate[is.na(listings$host_response_rate)] <- median_response_rate
  }
} else {
  output_text <- paste0(output_text, "5. Handling missing values... No missing values detected after initial cleaning\n")
}

# Verify missing values
missing_after <- listings %>%
  summarise_all(~sum(is.na(.))) %>%
  gather(variable, missing_count) %>%
  filter(missing_count > 0)

if(nrow(missing_after) > 0) {
  output_text <- paste0(output_text, "\nVERIFYING IMPUTATION RESULTS:\n  Still have missing values in:\n")
  for(i in 1:nrow(missing_after)) {
    output_text <- paste0(output_text, sprintf("- %s: %d missing\n",
                                               missing_after$variable[i],
                                               missing_after$missing_count[i]))
  }
} else {
  output_text <- paste0(output_text, "\nVERIFYING IMPUTATION RESULTS:\n All missing values successfully handled!\n")
}

output_text <- paste0(output_text, " Missing data imputation strategy completed.\n\n")

# 6. Feature engineering
listings <- listings %>%
  mutate(
    is_popular_area = neighbourhood_cleansed %in% c("Bondi", "Sydney", "Manly", "Darlinghurst", "Surry Hills"),
    distance_from_cbd = sqrt((latitude - (-33.8688))^2 + (longitude - 151.2093)^2),
    property_size = case_when(
      accommodates <= 2 ~ "Small",
      accommodates <= 4 ~ "Medium",
      accommodates <= 8 ~ "Large",
      TRUE ~ "Extra Large"
    ),
    host_experience = case_when(
      host_listings_count == 1 ~ "Single Property",
      host_listings_count <= 5 ~ "Small Portfolio",
      TRUE ~ "Large Portfolio"
    ),
    availability_level = case_when(
      availability_365 < 90 ~ "Low",
      availability_365 < 180 ~ "Medium",
      TRUE ~ "High"
    )
  )

# 7. Remove duplicates
initial_rows <- nrow(listings)
listings <- listings %>% distinct()
duplicates_removed <- initial_rows - nrow(listings)

# Recreate target variable
listings$price_category <- cut(listings$price_numeric,
                               breaks = c(0, 100, 200, Inf),
                               labels = c("Budget", "MidMarket", "Premium"),
                               include.lowest = TRUE)

# Final dataset summary
final_dim <- dim(listings)
final_target_dist <- table(listings$price_category)
final_target_prop <- round(prop.table(final_target_dist), 3)

output_text <- paste0(output_text, "Final Cleaned Dataset:\nDimensions: ", final_dim[1], " rows x ", final_dim[2], " columns\n")
output_text <- paste0(output_text, "Complete cases: ", sum(complete.cases(listings)), "\n\n")
output_text <- paste0(output_text, "Final target distribution:\n")
for(level in names(final_target_dist)) {
  output_text <- paste0(output_text, sprintf("%-10s : %d (%.3f)\n", level, final_target_dist[level], final_target_prop[level]))
}


cat(output_text)
## Initial data dimensions: 18187 rows x 22 columns
## 
## Removed 3180 extreme price outliers (>$1000)
## Remaining observations: 15007
## 
## - host_response_rate: 2541 missing (16.93%)
## - review_scores_rating: 2375 missing (15.83%)
## - reviews_per_month: 2375 missing (15.83%)
## - host_is_superhost: 486 missing (3.24%)
## - bedrooms: 18 missing (0.12%)
## - bathrooms: 5 missing (0.03%)
## - host_listings_count: 2 missing (0.01%)
## - host_identity_verified: 2 missing (0.01%)
## 
## 
## VERIFYING IMPUTATION RESULTS:
##  All missing values successfully handled!
##  Missing data imputation strategy completed.
## 
## Final Cleaned Dataset:
## Dimensions: 15007 rows x 28 columns
## Complete cases: 15007
## 
## Final target distribution:
## Budget     : 2181 (0.145)
## MidMarket  : 5433 (0.362)
## Premium    : 7393 (0.493)

4.3 Handling Rare Categorical Levels

To prevent modeling errors from rare categories appearing only in train or test sets, we group infrequent levels into an “Other” category.

Show/Hide Code & Results
# Function to collapse rare categories into "Other"
collapse_rare_levels <- function(data, column, min_freq = 30) {
  freq_table <- table(data[[column]])
  rare_levels <- names(freq_table[freq_table < min_freq])

  if (length(rare_levels) > 0) {
    data[[column]] <- as.character(data[[column]])
    data[[column]][data[[column]] %in% rare_levels] <- "Other"
    data[[column]] <- as.factor(data[[column]])
  }
  return(data)
}

# Apply to high-cardinality categorical variables
listings <- collapse_rare_levels(listings, "property_type", min_freq = 30)
listings <- collapse_rare_levels(listings, "neighbourhood_cleansed", min_freq = 20)

cat("After collapsing rare categories:\n")
## After collapsing rare categories:
cat("Unique property types:", length(unique(listings$property_type)), "\n")
## Unique property types: 25
cat("Unique neighbourhoods:", length(unique(listings$neighbourhood_cleansed)), "\n")
## Unique neighbourhoods: 38

5 Exploratory Data Analysis

Exploratory data analysis was conducted to uncover key patterns and relationships within the Sydney Airbnb market (Inside Airbnb, 2025; Cox, 2024). The distribution of nightly prices reinforced the decision to classify listings into Budget, Mid-market, and Premium market segments. The type of room became as a major determinant of price, with entire homes and apartments commanding higher rates than shared or private rooms (Australian Bureau of Statistics, 2023; NSW Government, 2024). Additional analyses showed that listings with more reviews and greater availability tended to cluster in the Budget and Mid-market categories, whereas Premium properties were less frequent but typically associated with high demand tourist areaslike the Sydney city.

5.1 Target Variable Distribution Analysis

Show/Hide Code & Results
# Target distribution
p1 <- ggplot(listings, aes(x = price_category, fill = price_category)) +
  geom_bar() +
  geom_text(stat = 'count', aes(label = paste0(..count.., "\n(",
    round(..count../sum(..count..)*100, 1), "%)")), vjust = -0.5) +
  labs(title = "Distribution of Price Categories",
       subtitle = "Classification target variable",
       x = "Price Category", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

# Accommodates by category
p2 <- ggplot(listings, aes(x = accommodates, fill = price_category)) +
  geom_histogram(bins = 15, position = "dodge", alpha = 0.7) +
  labs(title = "Guest Capacity Distribution by Price Category",
       x = "Number of Guests Accommodated", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  facet_wrap(~price_category, ncol = 1, scales = "free_y")

# Bedrooms by category
p3 <- ggplot(listings, aes(x = bedrooms, fill = price_category)) +
  geom_histogram(bins = 10, position = "dodge", alpha = 0.7) +
  labs(title = "Bedroom Distribution by Price Category",
       x = "Number of Bedrooms", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  facet_wrap(~price_category, ncol = 1, scales = "free_y")

# Distance from CBD by category boxplot
p4 <- ggplot(listings, aes(x = price_category, y = distance_from_cbd, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Distance from CBD by Price Category",
       subtitle = "Premium properties tend to be closer to city center",
       x = "Price Category", y = "Distance from CBD") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

# Histogram of actual price distribution within each category
p5 <- ggplot(listings, aes(x = price_numeric, fill = price_category)) +
  geom_histogram(bins = 30, alpha = 0.7) +
  labs(title = "Price Distribution Within Each Category",
       subtitle = "Examining the spread of actual prices within Budget, MidMarket, and Premium tiers",
       x = "Nightly Price (AUD)", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  facet_wrap(~price_category, ncol = 1, scales = "free") +
  scale_x_continuous(labels = dollar_format(prefix = "$"))

# Arrange plots
grid.arrange(p1, p4, ncol=1)

grid.arrange(p2, p3, ncol=2)

grid.arrange(p5, ncol=1)

5.2 Property Characteristics Analysis

Show/Hide Code & Results
# Property type analysis
p3 <- listings %>%
  count(property_type, price_category) %>%
  group_by(property_type) %>%
  filter(sum(n) >= 50) %>%     # keep property types with >= 50 listings
  ungroup() %>%
  ggplot(aes(x = reorder(property_type, n), y = n, fill = price_category)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Property Types by Price Category",
       subtitle = "Only property types with 50+ listings shown",
       x = "Property Type", y = "Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71",
                               "MidMarket" = "#f39c12",
                               "Premium" = "#e74c3c"))

# Room type analysis
p4 <- listings %>%
  ggplot(aes(x = room_type, fill = price_category)) +
  geom_bar(position = "fill") +
  labs(title = "Room Type Composition by Price Category",
       subtitle = "Proportion of each price category within room types",
       x = "Room Type", y = "Proportion") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71",
                               "MidMarket" = "#f39c12",
                               "Premium" = "#e74c3c")) +
  scale_y_continuous(labels = scales::percent_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Arrange both plots
gridExtra::grid.arrange(p3, p4, ncol = 2)

5.3 Feature Comparison Across Price Categories

Show/Hide Code & Results
# Statistical summary of numeric features by price category
numeric_summary <- listings %>%
  dplyr::select(price_category, accommodates, bedrooms, bathrooms, host_listings_count,
         number_of_reviews, review_scores_rating, availability_365,
         distance_from_cbd, amenities_count, minimum_nights) %>%
  group_by(price_category) %>%
  summarise(
    mean_accommodates = mean(accommodates, na.rm = TRUE),
    mean_bedrooms = mean(bedrooms, na.rm = TRUE),
    mean_bathrooms = mean(bathrooms, na.rm = TRUE),
    mean_amenities = mean(amenities_count, na.rm = TRUE),
    mean_reviews = mean(number_of_reviews, na.rm = TRUE),
    mean_rating = mean(review_scores_rating, na.rm = TRUE),
    mean_distance_cbd = mean(distance_from_cbd, na.rm = TRUE),
    mean_availability = mean(availability_365, na.rm = TRUE)
  )

print(kable(numeric_summary, digits = 2,
      caption = "Mean Feature Values by Price Category"))
## 
## 
## Table: Mean Feature Values by Price Category
## 
## |price_category | mean_accommodates| mean_bedrooms| mean_bathrooms| mean_amenities| mean_reviews| mean_rating| mean_distance_cbd| mean_availability|
## |:--------------|-----------------:|-------------:|--------------:|--------------:|------------:|-----------:|-----------------:|-----------------:|
## |Budget         |              1.85|          1.06|           1.22|          29.36|        34.01|        4.61|              0.15|            220.45|
## |MidMarket      |              3.04|          1.23|           1.16|          35.53|        55.43|        4.73|              0.10|            178.02|
## |Premium        |              4.92|          2.26|           1.62|          40.40|        32.65|        4.77|              0.10|            194.45|
# Boxplots comparing key numeric features across categories
p1 <- ggplot(listings, aes(x = price_category, y = accommodates, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Guest Capacity by Category", x = "", y = "Accommodates") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

p2 <- ggplot(listings, aes(x = price_category, y = bedrooms, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Bedrooms by Category", x = "", y = "Bedrooms") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

p3 <- ggplot(listings, aes(x = price_category, y = amenities_count, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Amenities by Category", x = "", y = "Amenity Count") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

p4 <- ggplot(listings, aes(x = price_category, y = review_scores_rating, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Review Scores by Category", x = "", y = "Rating") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

p5 <- ggplot(listings, aes(x = price_category, y = availability_365, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Availability by Category", x = "Price Category", y = "Days Available") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

p6 <- ggplot(listings, aes(x = price_category, y = number_of_reviews, fill = price_category)) +
  geom_boxplot() +
  labs(title = "Review Count by Category", x = "Price Category", y = "Number of Reviews") +
  theme_minimal() +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  theme(legend.position = "none")

# Arrange plots
grid.arrange(p1, p2, p3, p4, p5, p6, ncol=3)

# Categorical feature distributions by price category
cat("\n\nHost Superhost Status by Price Category:\n")
## 
## 
## Host Superhost Status by Price Category:
print(prop.table(table(listings$price_category, listings$host_is_superhost), margin = 1))
##            
##             FALSE
##   Budget        1
##   MidMarket     1
##   Premium       1
cat("\n\nRoom Type Distribution by Price Category:\n")
## 
## 
## Room Type Distribution by Price Category:
print(prop.table(table(listings$price_category, listings$room_type), margin = 1))
##            
##             Entire home/apt   Hotel room Private room  Shared room
##   Budget       0.1357175608 0.0032095369 0.8486932600 0.0123796424
##   MidMarket    0.8210933186 0.0064421130 0.1717283269 0.0007362415
##   Premium      0.9500879210 0.0048694711 0.0449073448 0.0001352631

5.4 Export Cleaned Dataset

After completing the exploratory data analysis, we export the cleaned and processed dataset for potential future use.

Show/Hide Code & Results
# Export the cleaned dataset with all engineered features
output_file <- "listings_cleaned_with_features.csv"
write_csv(listings, output_file)

cat("Cleaned dataset exported successfully!\n")
## Cleaned dataset exported successfully!
cat("File:", output_file, "\n")
## File: listings_cleaned_with_features.csv
cat("Location:", getwd(), "\n")
## Location: /Users/ABRAHAM/Documents/USYD/Sem 2/Computational Statistical Methods- STAT5003/Assignment2
cat("Dimensions:", nrow(listings), "rows x", ncol(listings), "columns\n")
## Dimensions: 15007 rows x 28 columns
cat("\nThis dataset includes:\n")
## 
## This dataset includes:
cat("- Original features after cleaning and imputation\n")
## - Original features after cleaning and imputation
cat("- Target variable: price_category (Budget, MidMarket, Premium)\n")
## - Target variable: price_category (Budget, MidMarket, Premium)
cat("- Engineered features: amenities_count, distance_from_cbd, is_popular_area,\n")
## - Engineered features: amenities_count, distance_from_cbd, is_popular_area,
cat("  property_size, host_experience, availability_level\n")
##   property_size, host_experience, availability_level

6 Modeling Plan

The modelling phase focuses on predicting Airbnb price categories using a classification approach (Inside Airbnb, 2025; Cox, 2024). To ensure robust results, five machine learning algorithms were selected that were discussed as part of the course. The dataset will be split into training and test sets, with cross-validation applied during training to minimize overfitting and improve generalization (Dhummad, 2025; Katyal, Sharma, & Kannan, 2025). Model performance will be assessed using multiple evaluation metrics - accuracy for overall correctness, precision and recall to capture domain performance, and macro or weighted F1-scores to account for potential class imbalance across the three price tiers. This comprehensive modeling plan balances interpretability with predictive accuracy, providing both actionable insights and reliable classification outcomes.

6.1 Selected Classification Models

We implement five classification ML algorithms, prioritizing methods taught in STAT5003 class to predict Sydney Airbnb price categories.

Model Purpose Strengths Use Case Rationale for Dataset
Multinomial Logistic Regression Baseline interpretable model Interpretable, fast, probability outputs Linear relationships Provides transparent baseline for feature contributions
Random Forest Ensemble method Handles mixed data, resistant to overfitting, feature importance Captures non-linear relationships Handles categorical & numerical features, identifies key drivers
Support Vector Machine High-dimensional classification Robust to outliers, flexible boundaries Complex decision boundaries Separates overlapping price categories using kernels
Linear Discriminant Analysis Dimensionality reduction Simple, interpretable, efficient Maximize class separation Reduces redundancy in correlated features
K Nearest Neighbors Non-parametric, instance-based Simple, local pattern recognition Geographic/neighborhood patterns Leverages localized pricing similarity

6.2 Model Implementation Strategy

  1. Baseline Models (Logistic Regression, LDA):
    • Establish performance benchmark
    • Identify most important linear predictors
    • Provide interpretable coefficients
  2. Tree-based Model (Random Forest):
    • Capture non-linear relationships
    • Handle feature interactions automatically
    • Provide feature importance rankings
  3. Distance-based Model (KNN):
    • Leverage geographic clustering
    • Capture local neighborhood effects
    • Non-parametric approach
  4. Kernel Method (SVM):
    • Complex decision boundaries
    • Robust to outliers
    • High-dimensional feature space
  5. Model Comparison:
    • Statistical significance testing
    • Computational efficiency analysis
    • Error pattern analysis
    • Business interpretation of results

6.3 Model Evaluation Framework

6.3.1 Data Splitting Strategy

The Sydney Airbnb dataset can be split in 70% training data and 30% test data. We can split the dataset into training and testing sets to ensure that our classification models learn patterns effectively and can generalize well into new and unseen data.

Show/Hide Code & Results
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
## 
##     MAE, RMSE
## The following object is masked from 'package:purrr':
## 
##     lift
# Stratified train-test split ratio (70:30)
set.seed(123)
train_indices <- createDataPartition(listings$price_category, p = 0.7, list = FALSE)
train_data <- listings[train_indices, ]
test_data <- listings[-train_indices, ]

# Compute sizes and percentages
train_size <- nrow(train_data)
test_size <- nrow(test_data)
train_pct <- round(train_size / nrow(listings) * 100, 1)
test_pct <- round(test_size / nrow(listings) * 100, 1)

# Class distributions
train_dist <- round(prop.table(table(train_data$price_category)), 3)
test_dist <- round(prop.table(table(test_data$price_category)), 3)

output_text <- paste0(
  "Data Splitting Summary:\n",
  "Training set size: ", train_size, " (", train_pct, "%)\n",
  "Test set size    : ", test_size, " (", test_pct, "%)\n\n",
  "Class distribution in training set:\n",
  paste(names(train_dist), ":", train_dist, collapse = "\n"), "\n\n",
  "Class distribution in test set:\n",
  paste(names(test_dist), ":", test_dist, collapse = "\n"), "\n"
)

cat(output_text)
## Data Splitting Summary:
## Training set size: 10507 (70%)
## Test set size    : 4500 (30%)
## 
## Class distribution in training set:
## Budget : 0.145
## MidMarket : 0.362
## Premium : 0.493
## 
## Class distribution in test set:
## Budget : 0.145
## MidMarket : 0.362
## Premium : 0.493

6.3.2 Cross-Validation Strategy

  • Method: 5-fold cross-validation on training set
  • Repetitions: 3 repetitions for robust estimates
  • Stratification: Maintain class proportions within each fold

6.3.3 Evaluation Metrics

Our comprehensive evaluation metrics can be classified under the following categories:

1. Overall Performance Metrics

  • Accuracy: Proportion of correctly classified instances
  • Kappa Statistic: Agreement between predicted and actual classifications (accounting for chance)
  1. Class-Specific Metrics
  • Precision: True Positives / (True Positives + False Positives)
  • Recall (Sensitivity): True Positives / (True Positives + False Negatives)
  • F1-Score: Harmonic mean of Precision and Recall
  • Specificity: True Negatives / (True Negatives + False Positives)
  1. Multi-Class Extensions
  • Macro-averaged metrics: Average metrics across all classes
  • Weighted-averaged metrics: Class-size weighted averages
  • Confusion Matrix: Detailed classification breakdown
  1. Advanced Metrics
  • Area Under ROC Curve (AUC): For each class vs. rest
  • Log-Loss: Probabilistic classification error
  • Balanced Accuracy: Average of class-specific accuracies

6.3.4 Feature Engineering for Models

Show/Hide Code & Results
# Prepare features for modeling
prepare_features <- function(data) {
  model_data <- data %>%
    dplyr::select(
      # Numeric features
      accommodates, bedrooms, bathrooms, host_listings_count,
      number_of_reviews, review_scores_rating, availability_365,
      minimum_nights, distance_from_cbd, amenities_count,

      # Categorical features
      property_type, room_type, neighbourhood_cleansed,
      host_is_superhost, host_identity_verified,
      is_popular_area, property_size, host_experience,
      availability_level,

      # Target variable
      price_category
    ) %>%
    na.omit()  # Remove any remaining missing values
  return(model_data)
}

# Preparing train and test datasets
train_features <- prepare_features(train_data)
test_features <- prepare_features(test_data)

# Get feature names excluding target variable
feature_names <- names(train_features)[names(train_features) != "price_category"]

# Feature type summary
feature_types <- train_features %>%
  dplyr::select(-price_category) %>%
  summarise_all(~ifelse(is.numeric(.), "Numeric", "Categorical")) %>%
  gather(Feature, Type) %>%
  count(Type)

# Format feature summary as text
feature_summary_text <- paste0(feature_types$Type, ": ", feature_types$n, collapse = ", ")

# Format feature names as single line with pipes
feature_names_text <- paste(feature_names, collapse = " | ")

cat(
  "Feature Preparation Summary:\n",
  "Training features shape: ", dim(train_features)[1], " rows x ", dim(train_features)[2], " columns\n",
  "Test features shape    : ", dim(test_features)[1], " rows x ", dim(test_features)[2], " columns\n",
  "Number of features for modeling (excluding target): ", ncol(train_features) - 1, "\n\n",
  "Feature type summary: ", feature_summary_text, "\n\n",
  "The 19 features for modeling:\n",
  feature_names_text, "\n"
)
## Feature Preparation Summary:
##  Training features shape:  10507  rows x  20  columns
##  Test features shape    :  4500  rows x  20  columns
##  Number of features for modeling (excluding target):  19 
## 
##  Feature type summary:  Categorical: 9, Numeric: 10 
## 
##  The 19 features for modeling:
##  accommodates | bedrooms | bathrooms | host_listings_count | number_of_reviews | review_scores_rating | availability_365 | minimum_nights | distance_from_cbd | amenities_count | property_type | room_type | neighbourhood_cleansed | host_is_superhost | host_identity_verified | is_popular_area | property_size | host_experience | availability_level
  • The initial dataset consisted of 20 features, including 12 numeric (such as id, accommodates, bedrooms, bathrooms, latitude, longitude, host_listings_count, review_scores_rating, number_of_reviews, reviews_per_month, availability_365, and minimum_nights), 6 character (price, property_type, room_type, amenities, neighbourhood_cleansed, and host_response_rate), and 2 boolean variables (host_is_superhost and host_identity_verified).

  • From these, 8 additional features were engineered: price_numeric and price_category from price, amenities_count from amenities, is_popular_area from neighbourhood_cleansed, distance_from_cbd from latitude and longitude, host_experience from host_listings_count, property_size from accommodates, and availability_level from availability_365.

  • For machine learning modeling, we finalized 19 predictive features—accommodates, bedrooms, bathrooms, host_listings_count, number_of_reviews, review_scores_rating, availability_365, minimum_nights, distance_from_cbd, amenities_count, property_type, room_type, neighbourhood_cleansed, host_is_superhost, host_identity_verified, is_popular_area, property_size, host_experience, and availability_level—with the target variable defined as price_category (Budget, Mid-Market, Premium).

6.3.5 Hyperparameter Tuning Strategy

Each model will undergo systematic hyperparameter optimization to select the best performing parameters:

  1. Random Forest
  • ntree: Number of trees (500, 1000, 1500)
  • mtry: Variables per split (sqrt(p), p/3, p/2)
  • nodesize: Minimum node size (1, 5, 10)
  1. Linear Discriminant Analysis
  • prior: Prior probabilities (equal, proportional to class frequencies, custom)
  • method: Estimation method (moment, mle, mve, t)
  • nu: Degrees of freedom for method=“t” (5, 10, 20)
  • tol: Tolerance for rank deficiency (1e-4, 1e-6, 1e-8)
  1. Support Vector Machine
  • cost: Regularization parameter (0.1, 1, 10, 100)
  • kernel: Kernel type (linear, radial, polynomial)
  • gamma: Kernel coefficient (0.001, 0.01, 0.1, 1)
  1. K Nearest Neighbors
  • k: Number of neighbors (3, 5, 7, 9, 11, 15)
  • Distance metric: Euclidean, Manhattan
  • Scaling: Standardized vs. normalized features

7 Model Implementation and Results

In this section, we implement five classification algorithms and evaluate their performance in predicting Sydney Airbnb price categories. Each model is trained using 3-fold cross-validation and evaluated on the held-out test set.

Show/Hide Code & Results
# Verify price_category levels are valid R names
cat("Price category levels:", levels(train_features$price_category), "\n")
## Price category levels: Budget MidMarket Premium
cat("Training set dimensions:", nrow(train_features), "x", ncol(train_features), "\n")
## Training set dimensions: 10507 x 20
cat("Test set dimensions:", nrow(test_features), "x", ncol(test_features), "\n")
## Test set dimensions: 4500 x 20
# Ensure factor levels are consistent
train_features$price_category <- factor(train_features$price_category,
                                        levels = c("Budget", "MidMarket", "Premium"))
test_features$price_category <- factor(test_features$price_category,
                                       levels = c("Budget", "MidMarket", "Premium"))

7.1 Model 1: Multinomial Logistic Regression

Multinomial logistic regression serves as our baseline interpretable model, extending binary logistic regression to handle three price categories simultaneously.

Show/Hide Code & Results
library(nnet)
library(caret)

# Set up cross-validation with repeated k-fold
train_control <- trainControl(
  method = "repeatedcv",
  number = 5,          # 5-fold cross-validation
  repeats = 3,         # 3 repetitions for robust estimates
  classProbs = TRUE,
  summaryFunction = multiClassSummary,
  savePredictions = "final",
  verboseIter = FALSE
)

# Train multinomial logistic regression
set.seed(123)
model_logit <- train(
  price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
    number_of_reviews + review_scores_rating + availability_365 +
    minimum_nights + distance_from_cbd + amenities_count +
    property_type + room_type + neighbourhood_cleansed +
    host_is_superhost + host_identity_verified + is_popular_area +
    property_size + host_experience + availability_level,
  data = train_features,
  method = "multinom",
  trControl = train_control,
  trace = FALSE,
  MaxNWts = 5000
)

# Predictions
logit_pred <- predict(model_logit, test_features)
logit_pred_prob <- predict(model_logit, test_features, type = "prob")

# Confusion Matrix
logit_cm <- confusionMatrix(logit_pred, test_features$price_category)
print(logit_cm)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Budget MidMarket Premium
##   Budget       502       134      23
##   MidMarket    131      1045     403
##   Premium       21       450    1791
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7418          
##                  95% CI : (0.7287, 0.7545)
##     No Information Rate : 0.4927          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5725          
##                                           
##  Mcnemar's Test P-Value : 0.4378          
## 
## Statistics by Class:
## 
##                      Class: Budget Class: MidMarket Class: Premium
## Sensitivity                 0.7676           0.6415         0.8078
## Specificity                 0.9592           0.8140         0.7937
## Pos Pred Value              0.7618           0.6618         0.7918
## Neg Pred Value              0.9604           0.8001         0.8097
## Prevalence                  0.1453           0.3620         0.4927
## Detection Rate              0.1116           0.2322         0.3980
## Detection Prevalence        0.1464           0.3509         0.5027
## Balanced Accuracy           0.8634           0.7277         0.8008
# Store results
logit_accuracy <- logit_cm$overall['Accuracy']
cat("\nLogistic Regression Test Accuracy:", round(logit_accuracy, 4), "\n")
## 
## Logistic Regression Test Accuracy: 0.7418

7.2 Model 2: Random Forest

Random Forest handles non-linear relationships and feature interactions through ensemble learning with decision trees.

Show/Hide Code & Results
library(randomForest)

# Train Random Forest with comprehensive hyperparameter tuning
set.seed(123)
model_rf <- train(
  price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
    number_of_reviews + review_scores_rating + availability_365 +
    minimum_nights + distance_from_cbd + amenities_count +
    property_type + room_type + neighbourhood_cleansed +
    host_is_superhost + host_identity_verified + is_popular_area +
    property_size + host_experience + availability_level,
  data = train_features,
  method = "rf",
  trControl = train_control,
  ntree = 500,  # Increased to 500 trees for more stable predictions
  importance = TRUE,
  tuneGrid = data.frame(mtry = c(4, 6, 9))  # sqrt(p) ≈ 4, p/3 ≈ 6, p/2 ≈ 9
)

# Predictions
rf_pred <- predict(model_rf, test_features)
rf_pred_prob <- predict(model_rf, test_features, type = "prob")

# Confusion Matrix
rf_cm <- confusionMatrix(rf_pred, test_features$price_category)
print(rf_cm)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Budget MidMarket Premium
##   Budget       515       116      22
##   MidMarket    126      1135     372
##   Premium       13       378    1823
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7718         
##                  95% CI : (0.7592, 0.784)
##     No Information Rate : 0.4927         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6229         
##                                          
##  Mcnemar's Test P-Value : 0.4275         
## 
## Statistics by Class:
## 
##                      Class: Budget Class: MidMarket Class: Premium
## Sensitivity                 0.7875           0.6967         0.8223
## Specificity                 0.9641           0.8265         0.8287
## Pos Pred Value              0.7887           0.6950         0.8234
## Neg Pred Value              0.9639           0.8277         0.8276
## Prevalence                  0.1453           0.3620         0.4927
## Detection Rate              0.1144           0.2522         0.4051
## Detection Prevalence        0.1451           0.3629         0.4920
## Balanced Accuracy           0.8758           0.7616         0.8255
# Feature Importance
rf_importance <- varImp(model_rf)
print(plot(rf_importance, top = 15, main = "Top 15 Important Features - Random Forest"))

# Store results
rf_accuracy <- rf_cm$overall['Accuracy']
cat("\nRandom Forest Test Accuracy:", round(rf_accuracy, 4), "\n")
## 
## Random Forest Test Accuracy: 0.7718

7.3 Model 3: Support Vector Machine (SVM)

SVM with radial basis function kernel creates complex decision boundaries in high-dimensional space.

Show/Hide Code & Results
library(e1071)

# Train SVM with RBF kernel and expanded hyperparameter grid
set.seed(123)
model_svm <- train(
  price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
    number_of_reviews + review_scores_rating + availability_365 +
    minimum_nights + distance_from_cbd + amenities_count +
    property_type + room_type + neighbourhood_cleansed +
    host_is_superhost + host_identity_verified + is_popular_area +
    property_size + host_experience + availability_level,
  data = train_features,
  method = "svmRadial",
  trControl = train_control,
  preProcess = c("center", "scale"),
  tuneLength = 5  # Test 5 different cost/sigma combinations
)
## line search fails -2.840481 0.04220264 1.036726e-05 6.663514e-06 -5.242233e-08 -1.732632e-08 -6.589302e-13
# Predictions
svm_pred <- predict(model_svm, test_features)
svm_pred_prob <- predict(model_svm, test_features, type = "prob")

# Confusion Matrix
svm_cm <- confusionMatrix(svm_pred, test_features$price_category)
print(svm_cm)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Budget MidMarket Premium
##   Budget       488       142      25
##   MidMarket    147      1056     381
##   Premium       19       431    1811
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7456          
##                  95% CI : (0.7326, 0.7582)
##     No Information Rate : 0.4927          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.5787          
##                                           
##  Mcnemar's Test P-Value : 0.2633          
## 
## Statistics by Class:
## 
##                      Class: Budget Class: MidMarket Class: Premium
## Sensitivity                 0.7462           0.6483         0.8169
## Specificity                 0.9566           0.8161         0.8029
## Pos Pred Value              0.7450           0.6667         0.8010
## Neg Pred Value              0.9568           0.8035         0.8187
## Prevalence                  0.1453           0.3620         0.4927
## Detection Rate              0.1084           0.2347         0.4024
## Detection Prevalence        0.1456           0.3520         0.5024
## Balanced Accuracy           0.8514           0.7322         0.8099
# Store results
svm_accuracy <- svm_cm$overall['Accuracy']
cat("\nSVM Test Accuracy:", round(svm_accuracy, 4), "\n")
## 
## SVM Test Accuracy: 0.7456

7.4 Model 4: Linear Discriminant Analysis (LDA)

LDA finds linear combinations of features that best separate the three price categories. We use only numeric features to avoid collinearity issues with categorical variables.

Show/Hide Code & Results
library(MASS)

# Train LDA with numeric features only (avoiding categorical variables that cause collinearity)
set.seed(123)
tryCatch({
  model_lda <- train(
    price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
      number_of_reviews + review_scores_rating + availability_365 +
      minimum_nights + distance_from_cbd + amenities_count,
    data = train_features,
    method = "lda",
    trControl = train_control,
    preProcess = c("center", "scale")
  )

  # Predictions
  lda_pred <- predict(model_lda, test_features)
  lda_pred_prob <- predict(model_lda, test_features, type = "prob")

  # Confusion Matrix
  lda_cm <- confusionMatrix(lda_pred, test_features$price_category)
  print(lda_cm)

  # Store results
  lda_accuracy <- lda_cm$overall['Accuracy']
  cat("\nLDA Test Accuracy:", round(lda_accuracy, 4), "\n")
  cat("Note: LDA uses numeric features only to avoid collinearity issues.\n")

}, error = function(e) {
  cat("\nLDA model failed due to collinearity issues. Using Naive Bayes as alternative.\n")
  cat("Error message:", conditionMessage(e), "\n")

  # Use Naive Bayes as a simpler alternative
  model_lda <<- train(
    price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
      number_of_reviews + review_scores_rating + availability_365 +
      minimum_nights + distance_from_cbd + amenities_count +
      room_type,
    data = train_features,
    method = "naive_bayes",
    trControl = train_control
  )

  lda_pred <<- predict(model_lda, test_features)
  lda_pred_prob <<- predict(model_lda, test_features, type = "prob")
  lda_cm <<- confusionMatrix(lda_pred, test_features$price_category)
  print(lda_cm)
  lda_accuracy <<- lda_cm$overall['Accuracy']
  cat("\nNaive Bayes (Alternative) Test Accuracy:", round(lda_accuracy, 4), "\n")
})
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Budget MidMarket Premium
##   Budget       281       109      60
##   MidMarket    354      1060     508
##   Premium       19       460    1649
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6644          
##                  95% CI : (0.6504, 0.6782)
##     No Information Rate : 0.4927          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4388          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: Budget Class: MidMarket Class: Premium
## Sensitivity                0.42966           0.6507         0.7438
## Specificity                0.95606           0.6998         0.7902
## Pos Pred Value             0.62444           0.5515         0.7749
## Neg Pred Value             0.90790           0.7793         0.7605
## Prevalence                 0.14533           0.3620         0.4927
## Detection Rate             0.06244           0.2356         0.3664
## Detection Prevalence       0.10000           0.4271         0.4729
## Balanced Accuracy          0.69286           0.6752         0.7670
## 
## LDA Test Accuracy: 0.6644 
## Note: LDA uses numeric features only to avoid collinearity issues.

7.5 Model 5: K-Nearest Neighbors (KNN)

KNN classifies properties based on similarity to their nearest neighbors in feature space.

Show/Hide Code & Results
# Train KNN with expanded k-value testing
set.seed(123)
model_knn <- train(
  price_category ~ accommodates + bedrooms + bathrooms + host_listings_count +
    number_of_reviews + review_scores_rating + availability_365 +
    minimum_nights + distance_from_cbd + amenities_count +
    property_type + room_type + neighbourhood_cleansed +
    host_is_superhost + host_identity_verified + is_popular_area +
    property_size + host_experience + availability_level,
  data = train_features,
  method = "knn",
  trControl = train_control,
  preProcess = c("center", "scale"),
  tuneGrid = expand.grid(k = c(3, 5, 7, 9, 11, 15))  # Test 6 different k values
)

# Predictions
knn_pred <- predict(model_knn, test_features)
knn_pred_prob <- predict(model_knn, test_features, type = "prob")

# Confusion Matrix
knn_cm <- confusionMatrix(knn_pred, test_features$price_category)
print(knn_cm)
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Budget MidMarket Premium
##   Budget       460       148      29
##   MidMarket    168       983     447
##   Premium       26       498    1741
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7076         
##                  95% CI : (0.694, 0.7208)
##     No Information Rate : 0.4927         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5149         
##                                          
##  Mcnemar's Test P-Value : 0.2425         
## 
## Statistics by Class:
## 
##                      Class: Budget Class: MidMarket Class: Premium
## Sensitivity                 0.7034           0.6034         0.7853
## Specificity                 0.9540           0.7858         0.7705
## Pos Pred Value              0.7221           0.6151         0.7687
## Neg Pred Value              0.9498           0.7774         0.7870
## Prevalence                  0.1453           0.3620         0.4927
## Detection Rate              0.1022           0.2184         0.3869
## Detection Prevalence        0.1416           0.3551         0.5033
## Balanced Accuracy           0.8287           0.6946         0.7779
# Store results
knn_accuracy <- knn_cm$overall['Accuracy']
cat("\nKNN Test Accuracy:", round(knn_accuracy, 4), "\n")
## 
## KNN Test Accuracy: 0.7076
cat("Optimal K:", model_knn$bestTune$k, "\n")
## Optimal K: 7

8 Model Comparison and Evaluation

8.1 Performance Metrics Comparison

Show/Hide Code & Results
# Compile all model results
# Check if LDA was replaced with Naive Bayes
lda_model_name <- if(exists("model_lda") && model_lda$method == "naive_bayes") {
  "Naive Bayes"
} else {
  "LDA"
}

model_names <- c("Logistic Regression", "Random Forest", "SVM", lda_model_name, "KNN")
confusion_matrices <- list(logit_cm, rf_cm, svm_cm, lda_cm, knn_cm)

# Extract metrics for each model
metrics_df <- data.frame(
  Model = model_names,
  Accuracy = sapply(confusion_matrices, function(cm) cm$overall['Accuracy']),
  Kappa = sapply(confusion_matrices, function(cm) cm$overall['Kappa']),
  Sensitivity_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Sensitivity']),
  Specificity_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Specificity']),
  Precision_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'Pos Pred Value']),
  F1_Budget = sapply(confusion_matrices, function(cm) cm$byClass[1, 'F1']),
  Sensitivity_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Sensitivity']),
  Specificity_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Specificity']),
  Precision_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'Pos Pred Value']),
  F1_MidMarket = sapply(confusion_matrices, function(cm) cm$byClass[2, 'F1']),
  Sensitivity_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Sensitivity']),
  Specificity_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Specificity']),
  Precision_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'Pos Pred Value']),
  F1_Premium = sapply(confusion_matrices, function(cm) cm$byClass[3, 'F1'])
)

# Display comprehensive metrics table
print(kable(metrics_df, digits = 4, caption = "Comprehensive Model Performance Metrics"))
## 
## 
## Table: Comprehensive Model Performance Metrics
## 
## |Model               | Accuracy|  Kappa| Sensitivity_Budget| Specificity_Budget| Precision_Budget| F1_Budget| Sensitivity_MidMarket| Specificity_MidMarket| Precision_MidMarket| F1_MidMarket| Sensitivity_Premium| Specificity_Premium| Precision_Premium| F1_Premium|
## |:-------------------|--------:|------:|------------------:|------------------:|----------------:|---------:|---------------------:|---------------------:|-------------------:|------------:|-------------------:|-------------------:|-----------------:|----------:|
## |Logistic Regression |   0.7418| 0.5725|             0.7676|             0.9592|           0.7618|    0.7647|                0.6415|                0.8140|              0.6618|       0.6515|              0.8078|              0.7937|            0.7918|     0.7997|
## |Random Forest       |   0.7718| 0.6229|             0.7875|             0.9641|           0.7887|    0.7881|                0.6967|                0.8265|              0.6950|       0.6959|              0.8223|              0.8287|            0.8234|     0.8228|
## |SVM                 |   0.7456| 0.5787|             0.7462|             0.9566|           0.7450|    0.7456|                0.6483|                0.8161|              0.6667|       0.6573|              0.8169|              0.8029|            0.8010|     0.8088|
## |LDA                 |   0.6644| 0.4388|             0.4297|             0.9561|           0.6244|    0.5091|                0.6507|                0.6998|              0.5515|       0.5970|              0.7438|              0.7902|            0.7749|     0.7590|
## |KNN                 |   0.7076| 0.5149|             0.7034|             0.9540|           0.7221|    0.7126|                0.6034|                0.7858|              0.6151|       0.6092|              0.7853|              0.7705|            0.7687|     0.7769|
# Calculate macro-averaged metrics
metrics_df$Macro_F1 <- rowMeans(cbind(metrics_df$F1_Budget,
                                       metrics_df$F1_MidMarket,
                                       metrics_df$F1_Premium), na.rm = TRUE)

# Overall performance visualization
p1 <- ggplot(metrics_df, aes(x = reorder(Model, Accuracy), y = Accuracy, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Accuracy, 3)), vjust = -0.5, size = 3.5) +
  coord_flip() +
  labs(title = "Model Accuracy Comparison",
       x = "Model", y = "Accuracy") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 1)

p2 <- ggplot(metrics_df, aes(x = reorder(Model, Macro_F1), y = Macro_F1, fill = Model)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Macro_F1, 3)), vjust = -0.5, size = 3.5) +
  coord_flip() +
  labs(title = "Macro-Averaged F1 Score Comparison",
       x = "Model", y = "Macro F1") +
  theme_minimal() +
  theme(legend.position = "none") +
  ylim(0, 1)

grid.arrange(p1, p2, ncol = 2)

# Class-specific performance visualization
f1_scores <- data.frame(
  Model = rep(model_names, 3),
  Category = rep(c("Budget", "MidMarket", "Premium"), each = 5),
  F1_Score = c(metrics_df$F1_Budget, metrics_df$F1_MidMarket, metrics_df$F1_Premium)
)

p3 <- ggplot(f1_scores, aes(x = Model, y = F1_Score, fill = Category)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "F1 Scores by Price Category",
       x = "Model", y = "F1 Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_fill_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c"))

print(p3)

8.2 Confusion Matrices Visualization

Show/Hide Code & Results
library(cvms)
library(tibble)

# Function to create confusion matrix plot
plot_confusion_matrix <- function(cm, title) {
  cm_table <- as.data.frame(cm$table)

  ggplot(cm_table, aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile() +
    geom_text(aes(label = Freq), color = "white", size = 6, fontface = "bold") +
    scale_fill_gradient(low = "#3498db", high = "#e74c3c") +
    labs(title = title, x = "Actual Category", y = "Predicted Category") +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, face = "bold"))
}

# Create confusion matrix plots for all models
cm1 <- plot_confusion_matrix(logit_cm, "Logistic Regression")
cm2 <- plot_confusion_matrix(rf_cm, "Random Forest")
cm3 <- plot_confusion_matrix(svm_cm, "SVM")
cm4 <- plot_confusion_matrix(lda_cm, "LDA")
cm5 <- plot_confusion_matrix(knn_cm, "KNN")

grid.arrange(cm1, cm2, cm3, cm4, cm5, ncol = 2)

8.3 ROC Curves and AUC Analysis

ROC curves provide insight into the trade-off between sensitivity (true positive rate) and specificity (false positive rate) for each price category.

Show/Hide Code & Results
library(pROC)
library(ggplot2)

# Function to calculate ROC for each class in multi-class problem
calculate_multiclass_roc <- function(predictions, actual, model_name) {
  roc_list <- list()
  auc_values <- c()

  # One-vs-Rest approach for each class
  classes <- levels(actual)

  for(class in classes) {
    # Create binary outcome: current class vs all others
    binary_actual <- ifelse(actual == class, 1, 0)
    class_prob <- predictions[, class]

    # Calculate ROC
    roc_obj <- roc(binary_actual, class_prob, quiet = TRUE)
    roc_list[[class]] <- roc_obj
    auc_values <- c(auc_values, auc(roc_obj))
  }

  return(list(roc_list = roc_list, auc_values = auc_values, classes = classes))
}

# Calculate ROC for all models
roc_logit <- calculate_multiclass_roc(logit_pred_prob, test_features$price_category, "Logistic Regression")
roc_rf <- calculate_multiclass_roc(rf_pred_prob, test_features$price_category, "Random Forest")
roc_svm <- calculate_multiclass_roc(svm_pred_prob, test_features$price_category, "SVM")
roc_lda <- calculate_multiclass_roc(lda_pred_prob, test_features$price_category, "LDA")
roc_knn <- calculate_multiclass_roc(knn_pred_prob, test_features$price_category, "KNN")

# Create ROC curve plot for each model
plot_roc_model <- function(roc_data, model_name) {
  plot_data <- data.frame()

  for(i in 1:length(roc_data$classes)) {
    class <- roc_data$classes[i]
    roc_obj <- roc_data$roc_list[[class]]
    auc_val <- roc_data$auc_values[i]

    temp_df <- data.frame(
      Specificity = 1 - roc_obj$specificities,
      Sensitivity = roc_obj$sensitivities,
      Class = paste0(class, " (AUC=", round(auc_val, 3), ")")
    )
    plot_data <- rbind(plot_data, temp_df)
  }

  ggplot(plot_data, aes(x = Specificity, y = Sensitivity, color = Class)) +
    geom_line(size = 1) +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "gray50") +
    labs(title = paste("ROC Curves -", model_name),
         x = "False Positive Rate (1 - Specificity)",
         y = "True Positive Rate (Sensitivity)") +
    theme_minimal() +
    theme(legend.position = "bottom") +
    coord_equal() +
    xlim(0, 1) + ylim(0, 1)
}

# Create plots for all models
p_roc1 <- plot_roc_model(roc_logit, "Logistic Regression")
p_roc2 <- plot_roc_model(roc_rf, "Random Forest")
p_roc3 <- plot_roc_model(roc_svm, "SVM")
p_roc4 <- plot_roc_model(roc_lda, lda_model_name)
p_roc5 <- plot_roc_model(roc_knn, "KNN")

grid.arrange(p_roc1, p_roc2, p_roc3, p_roc4, p_roc5, ncol = 2)

# Summary table of AUC values
auc_summary <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "SVM", lda_model_name, "KNN"),
  AUC_Budget = c(roc_logit$auc_values[1], roc_rf$auc_values[1], roc_svm$auc_values[1],
                 roc_lda$auc_values[1], roc_knn$auc_values[1]),
  AUC_MidMarket = c(roc_logit$auc_values[2], roc_rf$auc_values[2], roc_svm$auc_values[2],
                    roc_lda$auc_values[2], roc_knn$auc_values[2]),
  AUC_Premium = c(roc_logit$auc_values[3], roc_rf$auc_values[3], roc_svm$auc_values[3],
                  roc_lda$auc_values[3], roc_knn$auc_values[3])
)

auc_summary$Mean_AUC <- rowMeans(auc_summary[, 2:4])

print(kable(auc_summary, digits = 4, caption = "AUC Values by Model and Price Category"))
## 
## 
## Table: AUC Values by Model and Price Category
## 
## |Model               | AUC_Budget| AUC_MidMarket| AUC_Premium| Mean_AUC|
## |:-------------------|----------:|-------------:|-----------:|--------:|
## |Logistic Regression |     0.9584|        0.8189|      0.8901|   0.8891|
## |Random Forest       |     0.9676|        0.8522|      0.9108|   0.9102|
## |SVM                 |     0.9579|        0.8224|      0.8960|   0.8921|
## |LDA                 |     0.8935|        0.7582|      0.8427|   0.8315|
## |KNN                 |     0.9227|        0.7722|      0.8574|   0.8508|
cat("\nROC Curve Interpretation:\n")
## 
## ROC Curve Interpretation:
cat("- AUC = 1.0: Perfect classification\n")
## - AUC = 1.0: Perfect classification
cat("- AUC = 0.5: Random guessing (diagonal line)\n")
## - AUC = 0.5: Random guessing (diagonal line)
cat("- AUC > 0.8: Generally considered excellent\n")
## - AUC > 0.8: Generally considered excellent
cat("- AUC 0.7-0.8: Good classification performance\n")
## - AUC 0.7-0.8: Good classification performance

8.4 Best Model Selection and Interpretation

Show/Hide Code & Results
# Identify best model
best_model_idx <- which.max(metrics_df$Accuracy)
best_model_name <- metrics_df$Model[best_model_idx]
best_accuracy <- metrics_df$Accuracy[best_model_idx]

cat("\n========================================\n")
## 
## ========================================
cat("BEST MODEL:", best_model_name, "\n")
## BEST MODEL: Random Forest
cat("Test Accuracy:", round(best_accuracy, 4), "\n")
## Test Accuracy: 0.7718
cat("Macro F1 Score:", round(metrics_df$Macro_F1[best_model_idx], 4), "\n")
## Macro F1 Score: 0.7689
cat("========================================\n\n")
## ========================================
# Class-specific performance for best model
cat("Class-Specific Performance:\n")
## Class-Specific Performance:
cat("Budget:\n")
## Budget:
cat("  - Sensitivity (Recall):", round(metrics_df$Sensitivity_Budget[best_model_idx], 4), "\n")
##   - Sensitivity (Recall): 0.7875
cat("  - Precision:", round(metrics_df$Precision_Budget[best_model_idx], 4), "\n")
##   - Precision: 0.7887
cat("  - F1 Score:", round(metrics_df$F1_Budget[best_model_idx], 4), "\n\n")
##   - F1 Score: 0.7881
cat("MidMarket:\n")
## MidMarket:
cat("  - Sensitivity (Recall):", round(metrics_df$Sensitivity_MidMarket[best_model_idx], 4), "\n")
##   - Sensitivity (Recall): 0.6967
cat("  - Precision:", round(metrics_df$Precision_MidMarket[best_model_idx], 4), "\n")
##   - Precision: 0.695
cat("  - F1 Score:", round(metrics_df$F1_MidMarket[best_model_idx], 4), "\n\n")
##   - F1 Score: 0.6959
cat("Premium:\n")
## Premium:
cat("  - Sensitivity (Recall):", round(metrics_df$Sensitivity_Premium[best_model_idx], 4), "\n")
##   - Sensitivity (Recall): 0.8223
cat("  - Precision:", round(metrics_df$Precision_Premium[best_model_idx], 4), "\n")
##   - Precision: 0.8234
cat("  - F1 Score:", round(metrics_df$F1_Premium[best_model_idx], 4), "\n\n")
##   - F1 Score: 0.8228
# Model insights
cat("\nKey Insights:\n")
## 
## Key Insights:
cat("- All models achieved >70% accuracy, demonstrating that Airbnb pricing patterns are learnable\n")
## - All models achieved >70% accuracy, demonstrating that Airbnb pricing patterns are learnable
cat("- Random Forest likely performs best due to ability to capture non-linear feature interactions\n")
## - Random Forest likely performs best due to ability to capture non-linear feature interactions
cat("- Geographic features (distance_from_cbd, neighbourhood) appear critical for classification\n")
## - Geographic features (distance_from_cbd, neighbourhood) appear critical for classification
cat("- Property characteristics (bedrooms, accommodates) strongly differentiate price tiers\n")
## - Property characteristics (bedrooms, accommodates) strongly differentiate price tiers
cat("- MidMarket category may be hardest to classify due to overlap with adjacent categories\n")
## - MidMarket category may be hardest to classify due to overlap with adjacent categories

9 Introduced Business Innovation and Expected Outcomes

  1. Predictive Insights:
    • Identify key drivers of premium pricing in Sydney area
    • Quantify impact of location vs. property characteristics on the nightly prices
    • Understand host quality impact on pricing inflation
  2. Market Segmentation:
    • Clear classification of Sydney accommodation market
    • Neighborhood-specific pricing pattern variations
    • Property type optimization strategies
  3. Policy Implications:
    • Evidence for short-term rental regulations
    • Impact assessment on Sydney housing affordability and crisis
    • Tourism industry planning insights and annual budgeting
  4. Business Applications:
    • Investment guidance for property owners
    • Pricing optimization for hosts
    • Market entry strategies for new listings
  5. Technical Contributions:
    • Comparative analysis of ML algorithms on Sydney data
    • Feature importance insights for accommodation pricing
    • Geographic modeling approaches for real estate markets

10 Conclusion

This analysis establishes a robust foundation for understanding Sydney’s short-term rental market through the lens of data science. Our comprehensive examination of 15,000+ Airbnb properties reveals clear market segmentation patterns that reflect broader Australian housing dynamics.

Key Findings

  1. Market Structure: Sydney’s accommodation market demonstrates distinct pricing tiers, with premium properties concentrated in iconic locations. The data reveals that location, amenities, and host quality are primary drivers of pricing power.

  2. Data Quality: Through systematic cleaning and feature engineering, we transformed raw listing data into a modeling-ready dataset with 19 carefully selected features. Missing data patterns were strategically addressed using domain knowledge, achieving 100% data completeness.

  3. Geographic Insights: Distance from Sydney’s CBD emerges as a critical pricing factor, while neighborhood-specific patterns highlight the premium commanded by waterfront and central locations.


11 Appendix

11.1 References

  1. Inside Airbnb. (2025). Sydney, New South Wales, Australia Dataset. Retrieved from http://insideairbnb.com/get-the-data/

  2. Cox, M. (2024). Inside Airbnb: Adding Data to the Debate. Retrieved from http://insideairbnb.com/about.html

  3. Australian Bureau of Statistics. (2023). Housing Occupancy and Costs. Retrieved from https://www.abs.gov.au/

  4. NSW Government. (2024). Short-term Rental Accommodation Industry in NSW. Retrieved from https://www.nsw.gov.au/

  5. James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in R (2nd ed.). Springer.

  6. Dhummad, S. (2025). The Imperative of Exploratory Data Analysis in Machine Learning. Scholars Journal of Engineering and Technology, 13.

  7. Katyal, A., Sharma, P. K., & Kannan, M. (2025). Exploratory Data Analysis (EDA) on Undergraduate Data Science Students Through R Programming.

  8. Michelucci, U. (2025). Data Visualisation. In Statistics for Scientists: A Concise Guide for Data-driven Research (pp. 109-119). Cham: Springer Nature Switzerland.


11.2 Data Dictionary

Show/Hide Code & Results
# Creating a comprehensive data dictionary
data_dict <- data.frame(
  Variable = c("price_category", "accommodates", "bedrooms", "bathrooms",
               "property_type", "room_type", "neighbourhood_cleansed",
               "latitude", "longitude", "host_is_superhost", "host_response_rate",
               "host_listings_count", "review_scores_rating", "number_of_reviews",
               "availability_365", "minimum_nights", "amenities_count",
               "distance_from_cbd", "is_popular_area", "property_size"),
 
  Type = c("Categorical", "Numeric", "Numeric", "Numeric",
           "Categorical", "Categorical", "Categorical",
           "Numeric", "Numeric", "Logical", "Numeric",
           "Numeric", "Numeric", "Numeric",
           "Numeric", "Numeric", "Numeric",
           "Numeric", "Logical", "Categorical"),
 
  Description = c("Target variable: Budget (<$100), Mid-Market ($100-200), Premium (>$200)",
                  "Maximum number of guests property can accommodate",
                  "Number of bedrooms available",
                  "Number of bathrooms available",
                  "Type of property (Apartment, House, etc.)",
                  "Type of rental (Entire home, Private room, Shared room)",
                  "Sydney neighbourhood/suburb name",
                  "Geographic latitude coordinate",
                  "Geographic longitude coordinate",
                  "Whether host has Superhost status",
                  "Host response rate as proportion (0-1)",
                  "Number of listings managed by host",
                  "Average review score rating (1-5 scale)",
                  "Total number of reviews received",
                  "Days available for booking per year",
                  "Minimum nights required for booking",
                  "Number of amenities provided",
                  "Calculated distance from Sydney CBD",
                  "Whether in popular tourist area",
                  "Property size category based on capacity")
)

kable(data_dict, caption = "Complete Data Dictionary for Model Features")
Complete Data Dictionary for Model Features
Variable Type Description
price_category Categorical Target variable: Budget (<$100), Mid-Market ($100-200), Premium (>$200)
accommodates Numeric Maximum number of guests property can accommodate
bedrooms Numeric Number of bedrooms available
bathrooms Numeric Number of bathrooms available
property_type Categorical Type of property (Apartment, House, etc.)
room_type Categorical Type of rental (Entire home, Private room, Shared room)
neighbourhood_cleansed Categorical Sydney neighbourhood/suburb name
latitude Numeric Geographic latitude coordinate
longitude Numeric Geographic longitude coordinate
host_is_superhost Logical Whether host has Superhost status
host_response_rate Numeric Host response rate as proportion (0-1)
host_listings_count Numeric Number of listings managed by host
review_scores_rating Numeric Average review score rating (1-5 scale)
number_of_reviews Numeric Total number of reviews received
availability_365 Numeric Days available for booking per year
minimum_nights Numeric Minimum nights required for booking
amenities_count Numeric Number of amenities provided
distance_from_cbd Numeric Calculated distance from Sydney CBD
is_popular_area Logical Whether in popular tourist area
property_size Categorical Property size category based on capacity

11.3 Extra Graphs

Geographic Analysis

Show/Hide Code & Results
# Geographic distribution
ggplot(listings, aes(x = longitude, y = latitude, color = price_category)) +
  geom_point(alpha = 0.6, size = 0.8) +
  labs(title = "Geographic Distribution of Properties by Price Category",
       subtitle = "Sydney Airbnb listings colored by price segment",
       x = "Longitude", y = "Latitude") +
  theme_minimal() +
  scale_color_manual(values = c("Budget" = "#2ecc71", "MidMarket" = "#f39c12", "Premium" = "#e74c3c")) +
  guides(color = guide_legend(override.aes = list(size = 3, alpha = 1)))


Neighbourhood Analysis

Show/Hide Code & Results
# Top neighbourhoods by count
top_neighbourhoods <- listings %>%
  count(neighbourhood_cleansed, sort = TRUE) %>%
  head(15)

p7 <- ggplot(top_neighbourhoods, aes(x = reorder(neighbourhood_cleansed, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 15 Sydney Neighbourhoods by Listing Count",
       x = "Neighbourhood", y = "Number of Listings") +
  theme_minimal()

# Median price by neighbourhood 
neighbourhood_price <- listings %>%
  filter(neighbourhood_cleansed %in% top_neighbourhoods$neighbourhood_cleansed) %>%
  group_by(neighbourhood_cleansed) %>%
  summarise(
    count = n(),
    median_price = median(price_numeric),
    premium_pct = mean(price_category == "Premium") * 100
  ) %>%
  arrange(desc(median_price))

p8 <- ggplot(neighbourhood_price, aes(x = reorder(neighbourhood_cleansed, median_price),
                                     y = median_price)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(title = "Median Price by Neighbourhood",
       subtitle = "Top 15 neighbourhoods by listing count",
       x = "Neighbourhood", y = "Median Price (AUD)") +
  theme_minimal() +
  scale_y_continuous(labels = dollar_format(prefix = "$"))

grid.arrange(p7, p8, ncol = 1)

This analysis was conducted as part of STAT5003 Computational Statistical Methods coursework, focusing on real-world application of machine learning techniques to Australian housing market data. The report has been prepared with the assistance of artificial intelligence (AI) tools. AI was used to support tasks such as research support, grammar correction and clarity improvement. All content has been reviewed and verified by the team to ensure accuracy, relevance and alignment with project objectives.